Starbucks Capstone Challenge:

Combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type


Business Understanding

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. The program used to create the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers. Once every few days, Starbucks sends out an offer to users of the mobile app.

This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products. Only the amounts of each transaction or offer are recorded.

There are three types of offers that can be sent:

  1. buy-one-get-one (BOGO): In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount
  2. discount: In a discount, a user gains a reward equal to a fraction of the amount spent.
  3. informational: In an informational offer, there is no reward, but neither is there a requisite amount that the user is expected to spend.

Some users might not receive any offer during certain weeks, and not all users receive the same offer, and that is the challenge to solve with this data set.

Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.

The goal is to combine transaction, demographic and offer data to:

determine which demographic groups respond best to which offer type


Data Understanding

load CLEANED DATA

to go directly to ANALYSIS

Exploring profile

profile.json : Rewards program users (17000 users x 5 fields)

Exploring portfolio

portfolio.json: Offers sent during 30-day test period (10 offers x 6 fields)

Every offer has a validity period before the offer expires:

Exploring transcript

Transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

transcript.json: Event log (306648 events x 4 fields)


Data Preparation

All the steps taken in this section are implemented in the code.load_data()

load CLEANED DATA

to go directly to ANALYSIS

Preparing profile

fitering out participants with missing values:

by filtering out the age we removed all missing values

Preparing portfolio

Preparing transcript

Fitering out transcript according to the participants available in the profile dataset

Filtering out participants that did not receive any offer

Since we will look at the offer received to measure the impact of each other we will check if all participants received at least 1 offer.

Filtering out participants with no transactions

Since we will measure the impact of each other with the transactions that occured we will check if all participants achieve at least 1 transaction.

Removing participants with no offer received or no transactions made

Extracting data from the value dictionary and split it into amount, offer_id, and reward


Data Analysis

In order to analyze the data we created to classes (see code.starbucks_class):

Data Analysis

Visualizing the time-line of events for one individual

This visualization represents all the events that occured over 30 days (720 hours) of data collection for one participant.

Each row represents an offer received by this individual. The light blue block represents the duration of the offer. In this example, the participants received 4 offers: 3 bogo and 1 informational.

For each offer, 4 different events are marked:

The 7 transactions for this individual are shown accross the offers as black dotted lines. The transactions are not specific to an offer, and a difficulty of this analysis was to assign transactions to specific offers.

Few considerations:

offer can overlap each other: the offer #4 starts during offer #3. In this situation the transaction 6 and 7 could be assigned to both offer #3 and offer #4.

in this example, all offers have been viewed. This is not always the case. Moreover, transactions can be made before the offer is viewed (see transactions 6 and 7 for offer #4)

transactions can be made before, during, or after an offer. We can find transactions between 2 offers (see transaction 5)

not all offer types are presented to all participants. This will lead to differences in overall number of offers in each category.

the number of offers presented to an individual is not always 4 (see further analysis for details).

We decided to assign a transaction to an offer (marked with white dot) if:

Creating metrics for analysis


DEPENDENT VARIABLE

(TO BE EDITED)


SAVE RESULTS & UPDATED PROFILE


LOADING RESULTS

back to import

to model


TARGET & FEATURES

TARGETS:

Who is responding to the offers?

In this section we will look who is viewing which offers. We are trying to identify first if we can identify a population that does not respond to all or some specific offers.

The quasi-equality of the rate of missing data ($\approx 63\%$) is probably a sign of the fact that this dataset is simulated

Overall, each offer presents the same rate of response: $\approx 97\%$ of each offer type is viewed

$88\%$ of the participants viewed all the offers presented to them, and over $99\%$ viewed over $66\%$ of the offers presented.

All customers viewed at least $50\%$ of the offers presented to them

we conclude that we do not have a response problem with any offer type

Who is converting which offer?

Looking at the offer completed after being viewed, I computed the average completion rate per offer type (more than one offer type could be presnted to each customer)

For the rest of this analysis we will consider sucess as completion rate of $50\%$ and higher


JOINING FEATURES AND TARGETS


CORRELATION MATRIX

We originally recorded the number of offers and the total spending because we thought that the former might influence the later. Yet, these 2 variables show very weak correlation coefficient ($r=0.09$).

However, the total_spending shows the highest correlation coefficient with the completion of the bogo and discount offers.


$INCOME = f(AGE)$

From this represntation we can clearly identify 2 income breaks at 75k, and 100k. From both the bogo and the discount offers completed groups we can identify a $3^{rd}$ income break at 50k. We will therefore split the income in 4 brackets:

  1. 30k - 50k,
  2. 50k - 75k,
  3. 75k - 100k
  4. > 100k

Accordingly, we can identify ages brackets:

  1. \< 36,
  2. 36 - 48,
  3. 48 - 75: (this last age break could be adjusted up to 80 yo)

$AGE = f(MEMBERSHIP)$

This representation shows 3 clear periods:

  1. start - August 2015
  2. August 2015 - August 2017
  3. August 2017 - end

It looks like a lot of women of all ages becoming member during the period 2 are very likely to complete the bogo and the discount offers.


$INCOME = f(MEMBERSHIP)$


TOTAL SPENDING PER GROUP


RESULTS TABLE


TOP 10 conversion by Spending


TOP 10 conversion by bogo


TOP 10 conversion for max difference

Find any individual profile


Model

load Results

Based on the conversion of bogo and discount, I am testing a continuous metric that would be better I quantifying the performance for each offer type. I am using the amount_viewed which is the sum of all transaction that occurred once an offer is viewed until the end of the diration of that offer.

Filtering

First I try to filter our population to make it more coherent by checking for outliers in the total_spending and the total_offers (total number of offers received by each customer)

After looking at these distributions I set:

I then keep in our results dataframe only the customers present in our filtered group.

Testing Continous Metric - amount_viewed

In this section I am testing if indeed the amount_viewed is responsive to the metric of success that we used the previous analysis: the conversion rate

The T-Test comparing the amount_viewed in the unsuccessful group and the successful group clearly reject the null hypothesis (p < 0.05). The unsuccessful group shows significantly lower spendings after viewing an offer than the successful group.

I will use the amount_viewed to build a model of performance for each offer type based on the demographics and the total_spending

Preparing Target & Features

To make the model more accurate I decided to:

Testing Model with different normalizations

the distribution of amount_viewed for the discount offer is clearly skewed, so I explored the possibility of normalizing it to make it more normal

I will test a linear regression model on the 3 distributions

Model Evaluation